Skip to content

Feat/backend#4

Merged
Sam-24-dev merged 3 commits into
mainfrom
feat/backend
Feb 23, 2026
Merged

Feat/backend#4
Sam-24-dev merged 3 commits into
mainfrom
feat/backend

Conversation

@Sam-24-dev

Copy link
Copy Markdown
Owner

This pull request significantly restructures and enhances the weekly ETL pipeline workflow, improves environment configuration options, and updates CI/CD triggers and dependency audit handling. The main focus is on modularizing ETL jobs by data source, improving artifact management, and adding robust validation and publishing steps. Additionally, several environment variables and workflow triggers have been updated for better flexibility and reliability.

ETL Pipeline Refactor and Enhancement:

  • The .github/workflows/etl_semanal.yml workflow is fully modularized: each data source (GitHub, StackOverflow, Reddit) now runs in its own job, with artifacts uploaded and aggregated in a dedicated aggregation job. This improves parallelism, error isolation, and maintainability. [1] [2] [3]
  • The aggregation job performs artifact handoff, validates required and optional outputs, runs quality gates, and uploads aggregate artifacts for downstream publishing.
  • The publish job restores and commits only changed data, with clear English summaries and improved commit messages.

Environment and Configuration Improvements:

  • New environment variables for controlling data write strategies and trend score engine selection are added to .env.example and set in the ETL workflow, enabling more flexible and transparent configuration. [1] [2]

CI/CD Workflow Updates:

  • CI and dependency security workflows now trigger on relevant branches (main, feat/backend, feat/frontend), ensuring checks are run for active development streams. [1] [2]
  • The dependency audit step in .github/workflows/dependency_security.yml now ignores a known NLTK vulnerability (CVE-2025-14009) until a fix is available, preventing unnecessary pipeline failures.

Deployment Workflow Safeguard:

  • The frontend deployment workflow now only runs for successful workflow runs on the main branch, reducing the risk of unintended deployments.

Most important changes:

ETL Pipeline Modularization and Validation

  • Refactored .github/workflows/etl_semanal.yml to run ETL jobs for GitHub, StackOverflow, and Reddit as separate jobs, each uploading its own artifacts, followed by an aggregation job that validates outputs, runs quality gates, and uploads aggregate artifacts for publishing. [1] [2] [3]
  • Added robust artifact handoff and validation steps to ensure all required and optional data files are present before proceeding, with clear error and warning reporting.

Configuration and Environment

  • Introduced new environment variables in .env.example and set them in the ETL workflow for data write strategies (DATA_WRITE_LEGACY_CSV, etc.) and the trend score engine selector (TREND_SCORE_ENGINE), allowing for granular control of ETL outputs. [1] [2]

CI/CD and Audit Workflow Improvements

  • Updated CI and dependency audit workflows to trigger on feat/backend and feat/frontend branches, ensuring active feature branches are tested and audited. [1] [2]
  • The dependency audit step now ignores CVE-2025-14009 for NLTK, preventing unnecessary failures until a fix is available.

Deployment Workflow Safeguard

  • The frontend deployment workflow is now restricted to only trigger on successful runs from the main branch, reducing accidental deployments from other branches.

User-facing Improvements

  • All ETL and publish job summaries, commit messages, and error outputs are now in clear English, improving clarity for contributors and reviewers.

Copilot AI review requested due to automatic review settings February 23, 2026 02:16
@Sam-24-dev Sam-24-dev merged commit c9ea6c5 into main Feb 23, 2026
6 checks passed

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a comprehensive V2 backend refactoring for the Technology Trend Analysis Platform, transitioning from a monolithic CSV-only pipeline to a modular, serverless data stack with enhanced quality controls, dual-write capabilities, and frontend bridge support. The changes span 59 files with significant architectural improvements while maintaining backward compatibility.

Changes:

  • Modularized ETL pipeline with parallel GitHub Actions jobs (GitHub, StackOverflow, Reddit) using artifact-based handoff and aggregation
  • Implemented dual-write storage strategy supporting legacy CSV, latest snapshots, and date-partitioned history with configurable environment flags
  • Added severity-based quality gate system with Pandera integration supporting critical/warning/info levels and degradation policies for partial source failures
  • Introduced DuckDB-based Trend Score engine with equivalence tests validating numeric parity with legacy pandas implementation
  • Created data product contract system with run/dataset manifests, SemVer versioning, and deterministic schema hashing
  • Implemented frontend bridge JSON export for historical trend data with feature flag-based partial cutover and CSV fallback

Reviewed changes

Copilot reviewed 44 out of 45 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
.github/workflows/etl_semanal.yml Refactored to parallel job architecture with artifact validation and conditional publishing
backend/trend_score.py Added engine selector supporting legacy pandas and DuckDB implementations
backend/trend_score_duckdb.py New DuckDB-based SQL engine for trend score computation
backend/validador.py Enhanced with Pandera quality checks and severity-based issue routing
backend/validate_csv_contract.py Updated with Pandera integration and configurable validation modes
backend/quality/pandera_schemas.py New module defining dataset schemas and multi-severity quality rules
backend/quality/degradation_policy.py New module implementing source availability degradation matrix
backend/config/data_product_contract.py New contract defining run and dataset manifest structures with validation
backend/config/schema_contract_utils.py New utilities for deterministic schema hashing and SemVer bump recommendations
backend/sync_assets.py Enhanced with latest/legacy prioritization and bridge JSON export integration
backend/export_history_json.py New module generating frontend bridge JSON from history snapshots
backend/base_etl.py Updated with dual-write support for legacy/latest/history destinations
backend/config/settings.py Added write strategy flags and path resolution utilities
frontend/lib/services/csv_service.dart Enhanced with bridge JSON loading and automatic CSV fallback
frontend/lib/config/feature_flags.dart New feature flag system for controlled bridge JSON cutover
frontend/lib/screens/home_screen.dart Added temporal trend view card demonstrating bridge integration
tests/* Comprehensive test coverage for new modules with 133 passing tests
docs/* Updated architecture, contracts, and implementation roadmap documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/dependency_policy.md
@@ -1,47 +1,61 @@
# Política mínima de dependencias y seguridad
# Politica de Dependencias y Seguridad

Copilot AI Feb 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BOM (Byte Order Mark) character \ufeff is present at the beginning of several documentation files. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM from these files for better compatibility.

Suggested change
# Politica de Dependencias y Seguridad
# Politica de Dependencias y Seguridad

Copilot uses AI. Check for mistakes.
Comment thread docs/data_contract.md
@@ -1,85 +1,113 @@
# Contrato de datos CSV (Backend Frontend)
# Contrato de Datos (Backend <-> Frontend)

Copilot AI Feb 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BOM (Byte Order Mark) character \ufeff is present at the beginning of this file. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM for better compatibility.

Suggested change
# Contrato de Datos (Backend <-> Frontend)
# Contrato de Datos (Backend <-> Frontend)

Copilot uses AI. Check for mistakes.
Comment on lines +102 to +202
return ["dataset manifest debe ser un objeto (dict/mapping)"]

for field in DATASET_REQUIRED_FIELDS:
if field not in dataset_manifest:
errors.append(f"falta campo requerido '{field}'")

dataset_name = dataset_manifest.get("dataset_logical_name")
if "dataset_logical_name" in dataset_manifest and not _is_non_empty_string(dataset_name):
errors.append("'dataset_logical_name' debe ser string no vacio")

version_semver = dataset_manifest.get("version_semver")
if "version_semver" in dataset_manifest and not is_valid_semver(version_semver):
errors.append("'version_semver' no cumple SemVer")

generated_at_utc = dataset_manifest.get("generated_at_utc")
if "generated_at_utc" in dataset_manifest and not is_valid_iso_utc(generated_at_utc):
errors.append("'generated_at_utc' no es ISO-8601 valido con zona horaria")

source_run_id = dataset_manifest.get("source_run_id")
if "source_run_id" in dataset_manifest and not _is_non_empty_string(source_run_id):
errors.append("'source_run_id' debe ser string no vacio")
if expected_run_id and source_run_id != expected_run_id:
errors.append("'source_run_id' no coincide con run_id del manifest principal")

schema_hash = dataset_manifest.get("schema_hash")
if "schema_hash" in dataset_manifest:
if not _is_non_empty_string(schema_hash) or _HEX64_RE.fullmatch(schema_hash.strip()) is None:
errors.append("'schema_hash' debe ser hash sha256 en hexadecimal (64 chars)")

row_count = dataset_manifest.get("row_count")
if "row_count" in dataset_manifest:
if not isinstance(row_count, int):
errors.append("'row_count' debe ser integer")
elif row_count < 0:
errors.append("'row_count' no puede ser negativo")

quality_status = dataset_manifest.get("quality_status")
if "quality_status" in dataset_manifest and quality_status not in DATASET_QUALITY_STATUSES:
errors.append(f"'quality_status' invalido: {quality_status}")

latest_path = dataset_manifest.get("latest_path")
if "latest_path" in dataset_manifest and not _is_non_empty_string(latest_path):
errors.append("'latest_path' debe ser string no vacio")

history_path = dataset_manifest.get("history_path")
if "history_path" in dataset_manifest:
if quality_status == "fail":
if history_path is not None and not _is_non_empty_string(history_path):
errors.append("'history_path' debe ser null o string no vacio cuando quality_status=fail")
elif not _is_non_empty_string(history_path):
errors.append("'history_path' debe ser string no vacio")

return errors


def validate_run_manifest(run_manifest: Mapping[str, Any]) -> tuple[bool, list[str]]:
"""Validates minimal structure and rules for a run manifest."""
errors: list[str] = []

if not isinstance(run_manifest, Mapping):
return False, ["run manifest debe ser un objeto (dict/mapping)"]

for field in RUN_REQUIRED_FIELDS:
if field not in run_manifest:
errors.append(f"falta campo requerido '{field}'")

run_id = run_manifest.get("run_id")
if "run_id" in run_manifest and not _is_non_empty_string(run_id):
errors.append("'run_id' debe ser string no vacio")

generated_at_utc = run_manifest.get("generated_at_utc")
if "generated_at_utc" in run_manifest and not is_valid_iso_utc(generated_at_utc):
errors.append("'generated_at_utc' no es ISO-8601 valido con zona horaria")

for field in ("source_window_start_utc", "source_window_end_utc"):
value = run_manifest.get(field)
if field in run_manifest and not is_valid_iso_utc(value):
errors.append(f"'{field}' no es ISO-8601 valido con zona horaria")

quality_gate_status = run_manifest.get("quality_gate_status")
if "quality_gate_status" in run_manifest and quality_gate_status not in QUALITY_GATE_STATUSES:
errors.append(f"'quality_gate_status' invalido: {quality_gate_status}")

for field in ("git_sha", "branch"):
value = run_manifest.get(field)
if field in run_manifest and not _is_non_empty_string(value):
errors.append(f"'{field}' debe ser string no vacio")

datasets = run_manifest.get("datasets")
if "datasets" in run_manifest:
if not isinstance(datasets, list):
errors.append("'datasets' debe ser lista")
elif not datasets:
errors.append("'datasets' no puede estar vacio")
else:
for index, dataset_manifest in enumerate(datasets):
dataset_errors = validate_dataset_manifest(
dataset_manifest,
expected_run_id=run_id if _is_non_empty_string(run_id) else None,
)
errors.extend(f"datasets[{index}]: {message}" for message in dataset_errors)

Copilot AI Feb 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple error messages in this file are in Spanish (e.g., 'dataset manifest debe ser un objeto', 'falta campo requerido', 'debe ser string no vacio', etc.). According to the coding style guide at docs/coding_style.md, backend modules should use English for comments and docstrings. Error messages should also follow this convention for consistency across the codebase. Consider translating these error messages to English.

Copilot uses AI. Check for mistakes.
Comment thread docs/coding_style.md
@@ -0,0 +1,62 @@
# Estandar de Estilo del Repositorio

Copilot AI Feb 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BOM (Byte Order Mark) character \ufeff is present at the beginning of this file. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM for better compatibility.

Suggested change
# Estandar de Estilo del Repositorio
# Estandar de Estilo del Repositorio

Copilot uses AI. Check for mistakes.
Comment thread docs/architecture.md
@@ -1,109 +1,96 @@
# Architecture -- Technology Trend Analysis Platform
# Arquitectura del Proyecto

Copilot AI Feb 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BOM (Byte Order Mark) character \ufeff is present at the beginning of this file. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM for better compatibility.

Suggested change
# Arquitectura del Proyecto
# Arquitectura del Proyecto

Copilot uses AI. Check for mistakes.
Comment thread README.md
│ Flutter Web Dashboard │
│ 4 views · fl_chart · Export ZIP · Responsive │
└─────────────────────────────────────────────────────┘
# Technology Trend Analysis Platform

Copilot AI Feb 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BOM (Byte Order Mark) character \ufeff is present at the beginning of this file. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM for better compatibility.

Suggested change
# Technology Trend Analysis Platform
# Technology Trend Analysis Platform

Copilot uses AI. Check for mistakes.
@Sam-24-dev Sam-24-dev deleted the feat/backend branch March 16, 2026 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants